This paper presents extended techniques aiming at the improvement of automatic speech recognition (ASR) in\nsingle-channel scenarios in the context of the REVERB (REverberant Voice Enhancement and Recognition Benchmark)\nchallenge. The focus is laid on the development and analysis of ASR front-end technologies covering speech\nenhancement and feature extraction. Speech enhancement is performed using a joint noise reduction and\ndereverberation system in the spectral domain based on estimates of the noise and late reverberation power spectral\ndensities (PSDs). To obtain reliable estimates of the PSDsââ?¬â?even in acoustic conditions with positive\ndirect-to-reverberation energy ratios (DRRs)ââ?¬â?we adopt the statistical model of the room impulse response explicitly\nincorporating DRRs, as well in combination with a novel proposed joint estimator for the reverberation time T60 and\nthe DRR. The feature extraction approach is inspired by processing strategies of the auditory system, where an\namplitude modulation filterbank is applied to extract the temporal modulation information. These techniques were\nshown to improve the REVERB baseline in our previous work. Here, we investigate if similar improvements are\nobtained when using a state-of-the-art ASR framework, and to what extent the results depend on the specific\narchitecture of the back-end. Apart from conventional Gaussian mixture model (GMM)-hidden Markov model (HMM)\nback-ends, we consider subspace GMM (SGMM)-HMMs as well as deep neural networks in a hybrid system. The\nspeech enhancement algorithm is found to be helpful in almost all conditions, with the exception of deep learning\nsystems in matched training-test conditions. The auditory feature type improves the baseline for all system\narchitectures. The relative word error rate reduction achieved by combining our front-end techniques with current\nback-ends is 52.7% on average with the REVERB evaluation test set compared to our original REVERB result.
Loading....